import altair as alt
import pandas as pd
Here, are we are going to work with Vancouver Street trees data set. I chose to work with a smaller data set that contains only 5,000 rows. Let's import the data and look at first few rows and then I am going to start exploratory data analysis for this data set.
trees_df = pd.read_csv(
"https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_vancouver_trees.csv",
parse_dates=["date_planted"],
)
trees_df.head()
| Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | assigned | ... | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 19886 | W 10TH AV | W 10TH AV | BIGNONIOIDES | Kitsilano | NaT | 34.0 | ODD | CATALPA | N | ... | 10 | Y | 9945 | COMMON CATALPA | 5 | 3200 | NaN | N | 49.263400 | -123.177100 |
| 1 | 7941 | W 59TH AV | W 59TH AV | SACCHARINUM | Marpole | NaT | 20.0 | ODD | ACER | Y | ... | 16 | Y | 50427 | SILVER MAPLE | 4 | 700 | NaN | N | 49.217059 | -123.120787 |
| 2 | 4613 | W 47TH AV | W 47TH AV | PLATANOIDES | Kerrisdale | NaT | 24.0 | ODD | ACER | N | ... | 12 | Y | 43456 | NORWAY MAPLE | 5 | 2200 | NaN | N | 49.229119 | -123.159841 |
| 3 | 7388 | COMMERCIAL DRIVE | COMMERCIAL DRIVE | EUCHLORA X | Grandview-Woodland | NaT | 8.0 | EVEN | TILIA | N | ... | C | Y | 69099 | CRIMEAN LINDEN | 3 | 1300 | NaN | N | 49.272647 | -123.069463 |
| 4 | 1894 | E 55TH AV | E 55TH AV | SPECIES | Victoria-Fraserview | NaT | 14.0 | EVEN | ABIES | N | ... | B | Y | 164752 | CRIMSON SUNSET NORWAY MAPLE | 5 | 1900 | NaN | N | 49.219958 | -123.067159 |
5 rows × 21 columns
trees_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 5000 non-null int64 1 std_street 5000 non-null object 2 on_street 5000 non-null object 3 species_name 5000 non-null object 4 neighbourhood_name 5000 non-null object 5 date_planted 2338 non-null datetime64[ns] 6 diameter 5000 non-null float64 7 street_side_name 5000 non-null object 8 genus_name 5000 non-null object 9 assigned 5000 non-null object 10 civic_number 5000 non-null int64 11 plant_area 4963 non-null object 12 curb 5000 non-null object 13 tree_id 5000 non-null int64 14 common_name 5000 non-null object 15 height_range_id 5000 non-null int64 16 on_street_block 5000 non-null int64 17 cultivar_name 2700 non-null object 18 root_barrier 5000 non-null object 19 latitude 5000 non-null float64 20 longitude 5000 non-null float64 dtypes: datetime64[ns](1), float64(3), int64(5), object(12) memory usage: 820.4+ KB
To answer these questions, I will need only the following columns. I kept date_planted in the dataframe for now. However I won't use it since more than half of the dates are missing.
trees_df = trees_df[
[
"on_street",
"species_name",
"neighbourhood_name",
"date_planted",
"diameter",
"genus_name",
"common_name",
"height_range_id",
"root_barrier",
]
]
trees_df
| on_street | species_name | neighbourhood_name | date_planted | diameter | genus_name | common_name | height_range_id | root_barrier | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | W 10TH AV | BIGNONIOIDES | Kitsilano | NaT | 34.0 | CATALPA | COMMON CATALPA | 5 | N |
| 1 | W 59TH AV | SACCHARINUM | Marpole | NaT | 20.0 | ACER | SILVER MAPLE | 4 | N |
| 2 | W 47TH AV | PLATANOIDES | Kerrisdale | NaT | 24.0 | ACER | NORWAY MAPLE | 5 | N |
| 3 | COMMERCIAL DRIVE | EUCHLORA X | Grandview-Woodland | NaT | 8.0 | TILIA | CRIMEAN LINDEN | 3 | N |
| 4 | E 55TH AV | SPECIES | Victoria-Fraserview | NaT | 14.0 | ABIES | CRIMSON SUNSET NORWAY MAPLE | 5 | N |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | E 6TH AV | MORDENSIS | Mount Pleasant | 2011-11-02 | 3.0 | CRATAEGUS | TOBA HAWTHORN | 1 | N |
| 4996 | E 22ND AV | PSEUDOPLATANUS | Kensington-Cedar Cottage | NaT | 12.5 | ACER | SYCAMORE MAPLE | 4 | N |
| 4997 | WILLOW ST | OXYACANTHA | Fairview | NaT | 20.0 | CRATAEGUS | ENGLISH HAWTHORN | 3 | N |
| 4998 | W 19TH AV | XX | Riley Park | 2017-05-15 | 3.0 | MAGNOLIA | MAGNOLIA 'GALAXY' | 1 | N |
| 4999 | GRANVILLE ST | SYLVATICA | Downtown | 2009-10-29 | 3.0 | FAGUS | EUROPEAN BEECH | 1 | N |
5000 rows × 9 columns
trees_df.describe()
| diameter | height_range_id | |
|---|---|---|
| count | 5000.000000 | 5000.000000 |
| mean | 12.132900 | 2.699800 |
| std | 9.310923 | 1.550923 |
| min | 0.250000 | 0.000000 |
| 25% | 4.250000 | 2.000000 |
| 50% | 10.000000 | 2.000000 |
| 75% | 17.000000 | 4.000000 |
| max | 182.000000 | 9.000000 |
trees_df.describe(exclude="number", datetime_is_numeric=True)
| on_street | species_name | neighbourhood_name | date_planted | genus_name | common_name | root_barrier | |
|---|---|---|---|---|---|---|---|
| count | 5000 | 5000 | 5000 | 2338 | 5000 | 5000 | 5000 |
| unique | 607 | 157 | 22 | NaN | 62 | 339 | 2 |
| top | W KING EDWARD AV | SERRULATA | Kensington-Cedar Cottage | NaN | ACER | KWANZAN FLOWERING CHERRY | N |
| freq | 59 | 464 | 441 | NaN | 1277 | 363 | 4662 |
| mean | NaN | NaN | NaN | 2003-10-10 07:57:19.863131008 | NaN | NaN | NaN |
| min | NaN | NaN | NaN | 1989-11-15 00:00:00 | NaN | NaN | NaN |
| 25% | NaN | NaN | NaN | 1997-12-11 06:00:00 | NaN | NaN | NaN |
| 50% | NaN | NaN | NaN | 2003-04-10 12:00:00 | NaN | NaN | NaN |
| 75% | NaN | NaN | NaN | 2009-11-06 00:00:00 | NaN | NaN | NaN |
| max | NaN | NaN | NaN | 2019-04-16 00:00:00 | NaN | NaN | NaN |
Let's first take a look at which columns are categorical and which ones are numerical.
categorical_columns = trees_df.select_dtypes("object").columns.tolist()
categorical_columns
['on_street', 'species_name', 'neighbourhood_name', 'genus_name', 'common_name', 'root_barrier']
numerical_columns = trees_df.select_dtypes("number").columns.tolist()
numerical_columns
['diameter', 'height_range_id']
Now, we can start answering the questions we pose at the beginning of this notebook.
neighbourhood_trees = (
alt.Chart(trees_df)
.mark_bar()
.encode(
alt.X("count()", title="Count of trees planted"),
alt.Y("neighbourhood_name", sort="x", title="Neighbourhood"),
)
).properties(title="Neighbourhood tree counts")
neighbourhood_trees
We can tell from the above bar chart that Kensington-Cedar Cottage, Renfrew-Collingwood, and Hastings-Sunrise are the top three neighbourhood in terms of number of tree planted.
tree_size_plot_scatter = (
alt.Chart(trees_df)
.mark_circle()
.encode(alt.X("diameter"), alt.Y("height_range_id"))
)
tree_size_plot_line = (
alt.Chart(trees_df)
.mark_line(color = 'Green')
.encode(alt.X("mean(diameter)"), alt.Y("height_range_id"))
)
tree_size_plot_scatter + tree_size_plot_line
I figured that using the mean of diameter for answering this question can hide information about how the diameter range is scattered for each height range. So I decided to consider both scatter plot with all the diameter point and a line plot with the mean of the diameter. I can see that there is one outlier point. I am going to remove that and repeat the chart to get a better understanding.
tree_size_plot_scatter = (
alt.Chart(trees_df[trees_df["diameter"] < 80])
.mark_circle()
.encode(alt.X("diameter", title = "Diameter"), alt.Y("height_range_id"))
)
tree_size_plot_line = (
alt.Chart(trees_df)
.mark_line(color = 'Green')
.encode(
alt.X("mean(diameter)", title=" Mean of diameter"),
alt.Y("height_range_id", title="Height range"),
)
)
tree_size_plot_scatter + tree_size_plot_line
From this plot, we can tell taller trees, by average has bigger diameter. However, I can tell from the scatter plot that there is good number of trees that are tall with smaller diameter.
Calculating the correlation, shows a positive relationship between this two columns.
corr_df = (
trees_df[numerical_columns].corr("pearson").stack().reset_index(name="correlation")
)
corr_df
| level_0 | level_1 | correlation | |
|---|---|---|---|
| 0 | diameter | diameter | 1.000000 |
| 1 | diameter | height_range_id | 0.752331 |
| 2 | height_range_id | diameter | 0.752331 |
| 3 | height_range_id | height_range_id | 1.000000 |
Now let's explore flowering cherry trees. These trees are beautiful in spring. Photographers and tourists can use these locations. here I am going to answer question 3.
cherry_trees = trees_df[trees_df["common_name"] == "KWANZAN FLOWERING CHERRY"]
cherry_trees
| on_street | species_name | neighbourhood_name | date_planted | diameter | genus_name | common_name | height_range_id | root_barrier | |
|---|---|---|---|---|---|---|---|---|---|
| 6 | BROUGHTON ST | SERRULATA | West End | NaT | 24.0 | PRUNUS | KWANZAN FLOWERING CHERRY | 3 | N |
| 14 | NASSAU DRIVE | SERRULATA | Victoria-Fraserview | NaT | 16.0 | PRUNUS | KWANZAN FLOWERING CHERRY | 3 | N |
| 46 | W 11TH AV | SERRULATA | Fairview | NaT | 17.0 | PRUNUS | KWANZAN FLOWERING CHERRY | 2 | N |
| 61 | E 23RD AV | SERRULATA | Riley Park | NaT | 26.0 | PRUNUS | KWANZAN FLOWERING CHERRY | 3 | N |
| 90 | E 28TH AV | SERRULATA | Renfrew-Collingwood | NaT | 38.0 | PRUNUS | KWANZAN FLOWERING CHERRY | 3 | N |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4928 | E 21ST AV | SERRULATA | Kensington-Cedar Cottage | NaT | 24.5 | PRUNUS | KWANZAN FLOWERING CHERRY | 2 | N |
| 4962 | ALBERTA ST | SERRULATA | Oakridge | NaT | 19.5 | PRUNUS | KWANZAN FLOWERING CHERRY | 2 | N |
| 4976 | PARKER ST | SERRULATA | Grandview-Woodland | NaT | 29.0 | PRUNUS | KWANZAN FLOWERING CHERRY | 3 | N |
| 4981 | W 20TH AV | SERRULATA | Arbutus-Ridge | NaT | 10.0 | PRUNUS | KWANZAN FLOWERING CHERRY | 2 | N |
| 4987 | FLEMING ST | SERRULATA | Victoria-Fraserview | NaT | 12.0 | PRUNUS | KWANZAN FLOWERING CHERRY | 3 | N |
363 rows × 9 columns
title = alt.TitleParams(
"Mount Pleasent neighbourhood has the most number of cherry trees",
subtitle="downtown vancouver has least cherry trees",
)
neighbourhood_cherry = (
alt.Chart(cherry_trees, title=title)
.mark_bar()
.encode(
alt.X("count()"), alt.Y("neighbourhood_name", sort="x", title="Neighbourhood")
)
)
neighbourhood_cherry
neighbourhood_cherry = (
alt.Chart(cherry_trees, height=250, width=150)
.mark_bar()
.encode(
alt.X("count()", title = ""),
alt.Y("neighbourhood_name", sort="x"),
color=alt.Color("height_range_id"),
tooltip="count()",
)
.facet("height_range_id")
.properties(title="a")
)
neighbourhood_cherry
There are 5 specific neighbourhoods that have few trees in the 4-height range, including Mount Pleasant, Dunbar-Southlands, Kerrisdale, Fairview, and West point Grey. However, each of these neighbourhood has less than 5 tall trees. We can see Victoria-Fraserview neighbourhood has 19 tall cherry trees in 3-high range.
(
alt.Chart(cherry_trees)
.mark_tick()
.encode(alt.X("diameter"), alt.Y("height_range_id"))
)
From the plot above, we can tell cherry trees with diameter bigger than 25, are among taller trees.
(
alt.Chart(cherry_trees)
.transform_density(
"diameter", groupby=["height_range_id"], as_=["diameter", "density"]
)
.mark_area()
.encode(x="diameter", y="density:Q", color="height_range_id")
)
We can tell that the most common diameter for different height range is different among cherry trees. for example, the most common diameter for shorter cherry trees is 5, whereas tallest cherry trees' most common diameter is about 32 inches.
However, I can tell from this density plot that for trees in height range 4, there is not enough example to be able to draw accurate conclusion, since the density plot seems to be cut at both ends.
diameter_order = (
cherry_trees.groupby("neighbourhood_name")["diameter"]
.median()
.sort_values()
.index.tolist()
)
box = (
alt.Chart(cherry_trees)
.mark_boxplot()
.encode(alt.X("diameter:Q"), alt.Y("neighbourhood_name:N", sort=diameter_order))
.properties(title=" Cherry trees diameter for neighbourhood")
)
bar = (
alt.Chart(cherry_trees)
.mark_bar()
.encode(
alt.X("diameter:Q"),
alt.Y("neighbourhood_name:N", sort=diameter_order),
tooltip="diameter",
)
.properties(title=" Cherry trees diameter for neighbourhood")
)
box | bar
We can tell from the above plot that Killarney has the thicker trees both in terms of median of the diameter and number of thicker trees. From the bar chart or the mouse hovering over the box plot, the max diameter for this neighbourhood is 34. Bar chart will show the Victoria-Fraserview has trees that their diameter reaches 46. However, for the box plot we can tell the median of tree diameter in this neighbourhood is lower than Killarney. What caused this neighbourhood to show a taller bar in bar chart is few trees that went above the 30 inches in diameter.
common_trees = (
trees_df["common_name"]
.value_counts()[:10]
.sort_values(ascending=False)
.reset_index()
)
common_trees
| index | common_name | |
|---|---|---|
| 0 | KWANZAN FLOWERING CHERRY | 363 |
| 1 | PISSARD PLUM | 301 |
| 2 | NORWAY MAPLE | 219 |
| 3 | CRIMEAN LINDEN | 151 |
| 4 | BOWHALL RED MAPLE | 105 |
| 5 | NIGHT PURPLE LEAF PLUM | 98 |
| 6 | HEDGE MAPLE | 93 |
| 7 | KOBUS MAGNOLIA | 93 |
| 8 | RED MAPLE | 92 |
| 9 | PYRAMIDAL EUROPEAN HORNBEAM | 85 |
common_trees_df = trees_df[trees_df["common_name"].isin(common_trees["index"])]
common_trees_df
| on_street | species_name | neighbourhood_name | date_planted | diameter | genus_name | common_name | height_range_id | root_barrier | |
|---|---|---|---|---|---|---|---|---|---|
| 2 | W 47TH AV | PLATANOIDES | Kerrisdale | NaT | 24.0 | ACER | NORWAY MAPLE | 5 | N |
| 3 | COMMERCIAL DRIVE | EUCHLORA X | Grandview-Woodland | NaT | 8.0 | TILIA | CRIMEAN LINDEN | 3 | N |
| 5 | ADERA ST | CERASIFERA | Kerrisdale | NaT | 1.0 | PRUNUS | PISSARD PLUM | 2 | N |
| 6 | BROUGHTON ST | SERRULATA | West End | NaT | 24.0 | PRUNUS | KWANZAN FLOWERING CHERRY | 3 | N |
| 7 | CHURCHILL ST | CERASIFERA | Shaughnessy | NaT | 9.0 | PRUNUS | PISSARD PLUM | 2 | N |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4976 | PARKER ST | SERRULATA | Grandview-Woodland | NaT | 29.0 | PRUNUS | KWANZAN FLOWERING CHERRY | 3 | N |
| 4981 | W 20TH AV | SERRULATA | Arbutus-Ridge | NaT | 10.0 | PRUNUS | KWANZAN FLOWERING CHERRY | 2 | N |
| 4983 | W 22ND AV | PLATANOIDES | Dunbar-Southlands | NaT | 25.0 | ACER | NORWAY MAPLE | 6 | N |
| 4985 | W 10TH AV | PLATANOIDES | West Point Grey | NaT | 19.0 | ACER | NORWAY MAPLE | 5 | N |
| 4987 | FLEMING ST | SERRULATA | Victoria-Fraserview | NaT | 12.0 | PRUNUS | KWANZAN FLOWERING CHERRY | 3 | N |
1600 rows × 9 columns
Let's explore this new data frame that I made.
common_trees_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1600 entries, 2 to 4987 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 on_street 1600 non-null object 1 species_name 1600 non-null object 2 neighbourhood_name 1600 non-null object 3 date_planted 471 non-null datetime64[ns] 4 diameter 1600 non-null float64 5 genus_name 1600 non-null object 6 common_name 1600 non-null object 7 height_range_id 1600 non-null int64 8 root_barrier 1600 non-null object dtypes: datetime64[ns](1), float64(1), int64(1), object(6) memory usage: 125.0+ KB
common_trees_df.describe()
| diameter | height_range_id | |
|---|---|---|
| count | 1600.000000 | 1600.000000 |
| mean | 13.891719 | 2.812500 |
| std | 7.988612 | 1.343399 |
| min | 0.250000 | 1.000000 |
| 25% | 7.500000 | 2.000000 |
| 50% | 13.000000 | 3.000000 |
| 75% | 19.000000 | 4.000000 |
| max | 46.000000 | 9.000000 |
Let's first find the categorical and numerical columns in common trees dataframe.
categorical_columns = common_trees_df.select_dtypes("object").columns.tolist()
categorical_columns
['on_street', 'species_name', 'neighbourhood_name', 'genus_name', 'common_name', 'root_barrier']
numerical_columns = common_trees_df.select_dtypes("number").columns.tolist()
numerical_columns
['diameter', 'height_range_id']
Now we can use this information to answer question 8 and visualize the distributions of all numerical columns in common trees dataframe. this will sure help us understand the data better.
(
alt.Chart(common_trees_df)
.mark_bar()
.encode(
alt.X(alt.repeat(), type="quantitative", bin=alt.Bin(maxbins=25)),
alt.Y("count()"),
)
.properties(width=250, height=150)
.repeat(numerical_columns, columns=4)
)
That the diameter of the trees plot has at least two peaks. most of the trees has a diameter between 2 to 4 inches and are of height range 2 and 3.
(
alt.Chart(common_trees_df)
.mark_rect()
.encode(
alt.X("diameter", bin=alt.Bin(maxbins=30)),
alt.Y("height_range_id", bin=alt.Bin(maxbins=30)),
alt.Color("count()", title=None),
)
.properties(width=350, height=350)
)
From the heat map above, the most frequent combination of height and diamtere among popular trees in vancouver is diamter between 2 and 4 and height range id 1.
I am hoping to get a better understanding of most frequent specie, tree name, and genus of all popular trees by answering this question.
tree_category_plot = (
alt.Chart(common_trees_df, height=250, width=300)
.mark_bar()
.encode(alt.X("count()"), alt.Y(alt.repeat(), type="nominal", sort="x"))
.properties(width=250)
.repeat(categorical_columns[1:], columns=2)
)
tree_category_plot
From these repeated plots, we can tell Ceratifera is the most common specie, Flowering cherry tree is the most common tree and Prinus is the most common genus. Renfrew-collingwood has the most of popular trees in Vancouver.
Answering this question, wiil help to have a better understanding the height and diameter changes for different specie and genues of trees as well as different neighbourhood.
diameter_order = []
for groupby_col in ["species_name", "neighbourhood_name", "genus_name", "common_name"]:
diameter_order.extend(
common_trees_df.groupby(groupby_col)
.median()["diameter"]
.sort_values()
.index.to_list()
)
# diameter_order
(
alt.Chart(common_trees_df)
.mark_boxplot()
.encode(
alt.X(alt.repeat("column"), type="quantitative"),
alt.Y(alt.repeat("row"), type="nominal", sort=diameter_order),
)
.properties(width=350, height=350)
.repeat(column=numerical_columns, row=categorical_columns[1:4])
)
This exploration of categorical and numerical columns leads to very interesting results. Among the species Platinoids has the largest diameter median and height median. The median of trees thickness in Marpole neighbourgood, is the largest.
This section explains which five plots I am going to include in my report and how they will be changed for the audience.
1: The plot for question 1, I can add more explanatory title and subtitle. removing the x axis and instead showing the counts of each neighbourhood tree beside it’s related bar.
2: Second plot from question 2, better axis labels. tool tip can be added for the line chart to show that the line marks the mean of diameter. adding a explanatory title.
3: Plot from question 4,adding title for the plot.
4: Plots from question 6, axis title and plot title needs work.
5: Plot from question 9, y axis ticks can be changed to be integer. Axis title and plot tile needs work. I think for public audience I change this plot to a square plot that size of squares and their colors reflect the count of observation. That probably is easier to understand.